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Abstract 

We apply the Minimum Description Length model selection approach to the de- 
tection of extra-solar planets, and use this example to show how specification of the 
experimental design affects the prior distribution on the model parameter space and 
hence the posterior likelihood which, in turn, determines which model is regarded as 
most 'correct'. Our analysis shows how conditioning on the experimental design can 
render a non-compact parameter space effectively compact, so that the MDL model 
selection problem becomes well-defined. 
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1 Introduction 



The Bayesian approach to parametric model selection requires the specification of a prior 
probability distribution over the parameter space. The Jeffreys' prior, which is proportional 
to the square root of the determinant of the Fisher information computed in the parameter 
space, has been shown to be the uniform prior over all distributions indexed by the parameters 
in a parametric family pQ. Geometrically, its integral over a region of the parameter space 
computes a volume that essentially measures the fraction of statistically distinguishable 
probability distributions within that region [lj. In this interpretation, the Jeffreys prior 
distribution 

W (0) = - V^SSLi-e (1) 

where G = {9\, ■ ■ ■ ,9d} simply measures the fractional volume of the small element d d Q 
relative to total volume of the parametric manifold V = J d d Q i/det J { j(Q). Here Jy is the 
Fisher information on the parameter space G M. d and d d Q is the standard Riemannian 
volume element on ~R d . The volume V also appears in the Minimum Description Length 
(MDL) approach to model selection [21 [3], conceptually because it effectively measures how 
many different distributions are describable by different parameter choices. 

An important difficulty in applying the MDL approach to model selection occurs when 
the parameter space is noncompact and the volume V diverges. In this case, from the 
Bayesian perspective, a uniform prior on the parameter space does not exist, while from 
the MDL perspective the number of models that might be describable diverges, leading to 
problems with the definition of the description length. Of course the parameter space can 
be cut off by hand, but unless the choice of cut-off is well founded, it can lead to artifacts 
in the comparison of different model families [H El E]. Unfortunately in many practical 
problems the parameter space is noncompact and V diverges. For example, in astrophysics, 
the detection of exoplanets depends on a model of the light coming from the occluded star. 
This model will contain a non-compact direction representing the orbital period of the planet 
- see, e.g., [TJ. For examples from psychophysics see, e.g., jl]. 

In this note we argue that merely specifying the experimental set-up - before the mea- 
surement of any actual data - influences the prior distribution on the parameter space. This 
occurs because, given the finite number of measurements in any experiment, many of the 
probability distributions indexed by a parametric manifold will be statistically indistinguish- 
able. In cases where the parameter space is noncompact, the uniform prior conditioned on 
the experimental setup can thus become well-defined. In the geometric language of [3], the 
volume that measures the number of probability distributions in the parametric family that 
are statistically distinguishable given a finite number of measurements can be finite even if 
the parameter space is non-compact. In effect, specifying the experimental set-up can render 
the parameter space compact. 

Our results illustrate how the choice of experimental set-up influences the measure on the 
parameter space of a model, thereby affecting which model is regarded as most 'correct'. In 



section [2] we briefly review the computation of posterior probabilities, and consider the effect 
of conditioning on the experimental set-up on the parameter space measure. In section [3] we 
apply these considerations to a physical problem: the analysis of light-curves of stars with 
orbiting planets. In this example we see that the volume of the parameter space is rendered 
effectively finite after the experimental set-up is specified. 

2 The effect of experimental design on the parameter 
space measure 

2.1 Review 

Suppose one is interested in some physical phenomenon, and has made N relevant measure- 
ments: Y = {y 1: . . . , Un}- Further suppose that there are two different parametric models, 
A and B, that aim to describe the phenomenon in question. The basic question to be an- 
swered is which of the two models is the better one, considering the experimental data Y. 
The probability-theoretic answer to this question is to compute the posterior probabilities 
P(A\Y) and P(B\Y), which we can write using the Bayes Rule as 

p(a\y) = j^J w(e)p(y|e), (2) 

where G = (9\, . . . , 9d) G M. d is the vector of variables parametrising A, and o;(G) is the 
volume form associated to the measure on the parameter space, which we will define shortly. 
A corresponding expression can also be written for P(B\Y). Since we wish to compare 
P(A\Y) and P(B\Y), we can ignore the common factor P(Y), and we will assume P(A) = 
P(B) and drop this factor as well. Thus the only remaining ingredient to be defined is the 
volume form u;(G); we simply quote the result from [I]: the volume form that gives equal 
weight to all statistically distinguishable distributions in the parametric family is 

.(G) = y^e t Jjj(Q) dd 
J^Q^det Jy(9) 

where Jij{Q) is the Fisher information matrix, defined as the second derivative of the 
Kullback-Leibler distance D(Q p \\Q q ): 

MQ P ) = d di d ej D(e p \\e p + ^ =0 , (4) 

D{e p \\e q ) = JdxQ p (x)ln^-. (5) 

where dx is the integration measure over the sample space {x}, and G p (x) is the distribu- 
tion function associated to the values of the parameters (6^, . . . , 6 p d ). Now we have defined 
everything needed to compute the posterior probabilities, and we illustrate the formalism by 
applying it to the analysis of light-curves. 



Using this, we can compute the Fisher information matrix by computing the Kullback- 
Leibler distance between two nearby points and Taylor expanding: 
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On the third line, the terms linear in Go vanish, as exchanging the order of integration and 
derivation, the integral of G>o will yield a constant 1, which then differentiates to zero. 

2.2 Effect of the experimental set-up 

The measure ([3]) is independent of the experimental data Y and is constructed under the 
assumption that the entire sample space can be measured by the observer. However, in real 
experiments, instrumental and design limitations only allow observation of some subset M 
of the sample space. Thus an observation either results in no detected outcome, or in a 
measurement yi G M. Thus the effective predicted distribution of measured outcomes is not 
the Q(y), but rather 

\ B° ut , no measured outcome, 

where 6° ut = f^t M dy Q(y). We will argue that if the models in the asymptotic regions of a 
noncompact parameter space differ in their predictions mostly outside the observable region 
M, the Fisher information for the effective distributions (|7j) can decay sufficiently quickly 
to render the volume V = J d d Q ■^/det Jij(&) finite. In this section we will give one set of 
sufficient conditions for this to happen and in Sec. 3 we will give a detailed example. 

Consider a model, specified by parameters 9 = (9i, . . . ,9d) € M. d , and a distribution 
©(f(x), with y € MJ 1 . We will slightly simplify notation simply referring to the distribution 
as Q(y) and understanding the implicit parameter dependence. Let us use spherical coor- 
dinates in the parameter space R d with p being the radial coordinate, i.e. (9i,. . . ,9 d ) — > 
(p, (pi, . . . , (pd-i)- Also consider an experimental set-up that can only make measurements in- 
side some compact region McK" Thus, the probability of no measurement being registered 
by this experiment is 6° ut = f^ M dy Q(y). 

Our first assumption is a smoothness condition, so that inside the region M the distri- 
bution does not fluctuate too much as one approaches the asymptotics of parameter space: 

di@(y)\ v&M <5(p), for large p, i = 1, . . . , d, (8) 



where S(p) goes to zero as p goes to infinity; we will later specify the exact scaling needed. 
Intuitively, this condition says that as the parameter p — > oo, the models do not differ too 
much inside the observable part of the sample space M. This allows us to estimate 
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(9) 



where Vol(M) denotes the volume of the compact region M. 

Secondly we assume that inside M, the distributions Q(y) do not decay too quickly as 
p — > oo. Intuitively, since any experiment will only measure a finite amount of data (say N 
points), if the probability of a single measurement lying inside M is significantly less than 
1/N, then the experimental set-up will not detect anything. Thus we will require 



e(y) 



y£M 



> e(p), for large p, 



(10) 



where again we will later specify the scaling of e(p) with 

Using these assumptions, we can establish an upper bound for the Fisher information 
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Thus the determinant of the Fisher information scales as 
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(11) 



(12) 



and for the integral V to be finite one must have suppression stronger than wTJet ~ p 
Thus the integral converges if 5 is suppressed more strongly than 



(13) 



From the experimental set-up one can estimate how e(p) scales with p, which then determines 
how S(p) needs to scale for the integral to converge. This is thus a sufficient condition for 
rendering the parameter space effectively finite. 

It is worth stressing that, following the above analysis, any method of deciding the validity 
of a model is impacted by the choice of the experiment in a completely computable way, and 
this should be taken into account when designing experiments. 



1 This condition can be relaxed by recognizing that if Q(y)\fi£M decays too quickly as p — > oo, then the 
models in the asymptotic region of the parameter space make no measurable predictions for experiments 
designed with a finite number of measurements. The example in the Sec. 3 will illustrate such a scenario. 
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Figure 1: An example of a light-curve. 



3 The probability of exo-planet detection 
3.1 Model for exo-planets 

Consider a star orbited by a planet so that the planet periodically passes between the star 
and Earth. The light output (light-curve) of such a star is a constant line, with a small 
periodic dip when the planet is eclipsing part of the star. One model for such a light-curve 
was proposed in [7] as 



y(T,D,r],T,b;t) = b ~^ 



tanh c(t + - ) — tanh c(t — -] 



(14) 



where 

_ Tsin^ , x 

t= T —. 15 

711] 

An example light-curve is shown in figure [TJ T is the period of the planet; i] is the duration of 
the transit, i.e. how long the planet eclipses the star; D is the depth of the dip in the curve; 
b is the total observed brightness of the star; and r is a phase parameter specifying when 
the planets transit occurs. Finally, c is a constant parameter specifying the sharpness of the 
edges of the light-curve, expected to be fairly large as the transition between transit/no- 
transit is relatively quick. The assumption c> 1 greatly simplifies our analysis, and is not 
physically very restrictive. 

The parameter space for this model is clearly non-compact as T can range to infinity. 
However, we will argue that the space is effectively rendered compact after the experimental 
set-up is specified. To be precise, the parameter space is§: 

Te[0,oo), De[0,b], r£[0,T], ??e[0,(5T], be [0,*w], (16) 

where 5 is a small number that we will estimate, and the maximal brightness b max is naturally 
given by the brightness of Sirius, the brightest star visible from Earth. Assuming a circular 
orbit as in Figure [2, the ratio of the transit time to the period of the planet is given by 

V_ _ 2r/v pl an et \r_ 

T 2nR/v p i anet 7i R' 



2 Note that we consider c to be a constant, not a parameter. 



Figure 2: The basic set-up: an extra-solar planet orbiting a star of radius r with an average 
distance R. 

For the currently known transiting exo-planets this ratio is around ~ 0.1 [8], although for 
a typical system one expects it to be smaller as large planets orbiting close to the star are 
easier to observe, which favors largest values of the ratio. For an elliptical orbit, the answer 
will differ by an 0(1) factor, but will have the same dependence on r/R. Thus, r\ will always 
be a small fraction of T. 

Now we can write down the probability density for measuring values y = (yi, . . . , y^) for 
the light-curve at times (t\, . . . , tjy) with the light-curve specified by parameters 
(e»,e°,0°,0l,0l) = (T,D,7i,T,b) as 

N 1 (»fc-TO(tf;*fc)) 2 
e (y) = TT ~F=e ^ = (27r ( T)-^e-^ Efc = l(j/fc - yo( ^ ; * ft)) , (17) 

where we have assumed that the uncertainty in each measurement is Gaussian, and further 
we have chosen the standard deviation to be equal for all measurements for simplicity. Using 
( I17p in the formula (jHJ), we see that the the integrals in the Fisher information are Gaussian 
in yk, thus we can compute them analytically to get 

1 - 

Jij = - 2 Y, d e l y^tk)de J y( e ^)- (18) 

k=l 

This is our key formula, and we shall spend the next subsection analysing its properties. 
3.2 Finiteness of light-curve parameter space 

We now wish to apply the general arguments of section [2] to the exo-planet system. Consider 
an experimental set-up that can barely measure two periods, and then consider shortening 
the experiment slightly so that only one dip is detected; this is depicted in figure [IJ To be 
precise, the shorter set-up measures the beginning and end of a transit at t\ and t 2 , n points 
in between, and m points after the transit. The longer set-up makes measurements at the 
same times, and additionally at times t$ and £4, detecting the second transit. In the next 
subsections we will show that J s hort ^ion g , indicating that detecting the second dip is of 



fundamental importance to experimental design; without the second dip the experimental 
set-up can't differentiate models with large enough T. This renders the parameter space 
effectively finite, as an experiment can not differentiate between models that have period T 
larger than the duration of the experiment. 

3.2.1 Effect of measuring a second transit on det(J) 

In this subsection we will give an estimate for the magnitude of the determinant of the 
Fisher information, and show how it is affected by the inclusion of the second transit in the 
data. In subsequent subsections we will exactly compute the determinant for a few specific 
experimental set-ups. 

From (118j) and the definition of a determinant, we see that in each term of the determinant 
each parameter $i appears exactly twice in the derivatives, i.e. each term is of the form 

Ji J J,, j ,.J,,_ j ,_.J,. r .J, J , ~ -^d T y d T y d D y d D y d v y d v y d T y d T y d b y d b y. (19) 

As a rough estimate of the size the determinant, we investigate how large terms of this type 
can be. The derivatives are 

„ ,„ v cDf(t)f ir(t-r) n(t-r) n(t - r)\ 

d T y(e,t) = —^^A^-J^oosA—Jj, (20 ) 



d D y(9,t) = - ^tanhc(t--)-tanhc(t+-)J , (21) 

n /« x cD fit) n(t-r) , . 

d T y(e,t) = - — A^cos y T \ (22) 

d ri y(6,t) = d b y(9,t) = l, (23) 

L 7] 

with f(t) = tanh 2 c(t + -) - tanh 2 c(t - -). (24) 

From (l2"4l we see that f(t)^0 only when t ~ ±|, again assuming large c. This tells us that 

the measurements that contribute most to the Fisher information are the ones on the edges 



of the dips^l , i.e. at times ^1,^2,^3 and t$ in figured! We write the condition |f| 



^ - as 



2 



■Kit 

sin 



T 



and note that the ratio of transit time to period is very small, | < 1. This gives us the 
solutions 

t — T Tj , . 

— »»±£, (26) 

where n is an integer indexing the number of the dip, with n = denoting the solitary dip 
if only one is present in the data. 



3 This statement is somewhat subtle, and we will discuss this matter in more detail in section [3.2.31 for 
our current purposes it is sufficiently accurate. 



We wish to estimate the ratio of the determinants of the Fisher information by an order 
of magnitude estimate 

J js JS JS JS JS 

° short U l2]2 »3J3 »4J4 t5j5 'max (97\ 

T ~ T l T l T l T l I ' 

■-'long J hji u hj2 hj3 Uji «5 35 I max 

where both the numerator and the denominator are of the form ffl9|) . and according to the 
argument above the maximal contributions come from the edge measurements. From (12TI - 
[231) we see that the derivatives with respect to D, r], r and b are all periodic at the edges: 
Idg^i^O, ti)\ = . . . = 1^2/(0, £4)1 for 9i 7^ T, and thus will cancel in the ratio (127]) . 

It is crucial that c^y, however, is not periodic due to the second term in (120]) . At the 
first dip, t±,t2 = we expand ( l20l to find 

dryih) « a T y(t 2 ) « (28) 
while at the second dip, ^3,^4 = 1 ± the contribution is 

\d T y(ts)\^\d T y(t 4 )\^^, (29) 

ignoring signs that are irrelevant for this estimate. Thus we see that the Fisher information 
increases strongly as the second dip is included: 



./short (d T y(t li2 )) 2 (r] 



/long (d T y(ti,2)) 2 + (&ry(t li2 )d T y(t 3A )) + (d T y(t 3A )Y 



(|) «<1, (30) 



where we ignored order one coefficients. This is an explicit example of how our arguments 
from section [2] work for a realistic model: when an experimental set-up does not have the ca- 
pability to detect two dips, it becomes impossible to determine the period, and consequently 
the Fisher information is very small (or vanishing) compared to an experiment that is able 
to detect two dips and determine the period more accurately. For any given experiment 
of finite duration At, the Fisher Information will decline with T when T ^> At effectively 
rendering the parameter space compact. 

3.2.2 The tail T -> 00 

To verify our claim that the parameter space is really rendered compact we need to show 
that det J —>■ strongly enough as T is taken to infinity. It is easy enough to find the 
T-scaling of the derivatives (I20H23P : dry scales as T -3 , while the others stay finite in the 
large T limit. Thus, as seen from ( fT9|) . the determinant will scale as 

1 



vdetJ ~ dry ~ ^ , (31) 



which shows that that the parameter space measure vanishes fast enough for large T to 
render the parameter space volume finite. 



3.2.3 Explicit computation of Det(Jy-) for specific experimental set-ups 



While the order of magnitude estimate of the previous subsection offers an intuitive reason 
as to why the Fisher information decreases sharply when the number of peaks detected falls 
below two, it is still instructive to explicitly compute the determinant in a few experimental 
set-ups. 



Detecting two dips: Let us first consider the case Ji ong from section 13. 2\ i.e. measure- 
ments at times indicated in figure [TJ Using the derivatives (I20H23P one can write down the 



Fisher information matrix (JT8l) as 
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(32) 



where for brevity we defined 

Ti = d T y(t x ) = d T y(t 2 ) -- 
cD 

X = ~— = 0„y(*l,2,3,4) 



cDtt 2 rf 
48 T3' 
d T y(ti,3 



T 3 = d T y(t 3 ) 
1 d T y(t 2A ) 



-d T y(U) = — , 



(33) 
(34) 



2 2 

In computing this matrix we used that f(t) = for t ^ ±|, which is true up to corrections 



of order e~ c , as seen from (1241) ; for this reason one does not need to specify the exact times 
of the n measurements during the dip, or the m measurements outside the dip, as up to e~ c 
corrections they all contribute equally. The determinant of the Fisher information is simple, 



Det(j] on ^ 
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64nmX 4 (T 1 2 + T 3 2 



64nmX 4 T 3 2 . 



(35) 



This result explains the subtlety referred to earlier: although measurements at the edges 
contribute the most to the Fisher information, if one only has measurements at the edges 
(n = m = 0) the Fisher information actually vanishes. Physically this is easy to interpret, 
as only measuring the edges t±, . . . ,t± will yield four points lying on a line, and thus they 
cannot be used to determine any information about the curve; other data points are needed 
to 'anchor' the data. 



Detecting only one dip: Similarly one can compute the Fisher information in the 'short' 
experimental set-up, where measurements are made at the same times as before, except not 
at t 3 and £4. This yields 
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(36) 



and perhaps surprisingly the determinant vanishes: Det(Jy lort ) = 0, up to tiny e c correc- 
tions. This indicates that the estimate in section 13.2.11 was an overestimate terms in the 
determinant of J short are of the magnitude estimated, but the determinant is arranged in 
such a way that the terms cancel to a high accuracy, and the compactness of the parameter 
space is strengthened. 

4 Discussion 

Our analysis has shown how the specification of an experimental design affects the measure 
on model parameter spaces in MDL model selection (or equivalently the prior probability 
distribution on parameters in the Bayesian approach). Interestingly, the finite number of 
measurements within a bounded sample space in any practical experiment can effectively 
render a non-compact parameter space compact thereby leading to a well-defined prior dis- 
tribution (jHJ). Our analysis could be turned around to design experiments to discriminate well 
between models in some chosen region of the parameter space by ensuring that the Fisher 
information ( fl8i) is large in the desired region. It would also be useful to determine gen- 
eral conditions under which experimental design effectively makes model parameter spaces 
compact, perhaps following the arguments of Sec. 2. 
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